82 ◾ Bioinformatics
the number of insertions (I) or deletion (D) from the CIGAR strings in a BAM file. Since
the CIGAR field is the sixth column, first, we will use “samtools view -F 0x4 SRR769545_
mem_sorted.bam” to extract the mapped records. Then, we can transfer that output to “cut
-f 6” using the pipe symbol “|” to separate the sixth column. The output is then transferred
to “grep -P” to select only the strings that have either the character “D” or “I” using the
class pattern “[ID]” to match any of the two characters. Then, the output is transferred to
the “tr” command to delete any characters other than “I” and “D”. Finally, the output is
transferred to the “wc -c” command to count the remaining characters:
samtools view \
-F 0x4 SRR769545_mem_sorted.bam \
| cut -f 6 \
| grep -P ‘[ID]’ \
| tr -cd ‘[ID]’ \
| wc -c
To count insertions and deletions separately, use the following, respectively:
samtools view \
-F 0x4 SRR769545_mem_sorted.bam \
| cut -f 6 \
| grep -P ‘I’ \
| tr -cd ‘I’ \
| wc -c
samtools view \
-F 0x4 SRR769545_mem_sorted.bam \
| cut -f 6 \
| grep -P ‘D’ \
| tr -cd ‘D’ \
| wc -c
Refer to Table 2.3 for the different FLAG values and descriptions.
2.4.1.6 Removing Duplicate Reads
Duplicate reads may be produced from the library construction, PCR amplification (PCR
duplicates), or a fault in the sequencing optical sensor (optical duplicates). A large number
of duplicate reads originating from a single fragment may create a bias in some applica-
tions, such as RNA-Seq, in which the count of reads has a biological interpretation. The
“samtools rmdup” command can be used to remove potential duplicate reads from BAM/
SAM files. If multiple reads have identical coordinates, only the read (read pair if paired
end) with the highest mapping quality will be retained. By default, this command works
for paired-end reads. The option “-s” is used if the reads are single end.
samtools rmdup \
SRR769545_mem_sorted.bam \